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The ability of sound-source localization in sagittal planes (along the top-down and 
front-back dimension) varies considerably across listeners. The directional acoustic 
spectral features, described by head-related transfer functions (HRTFs), also vary 
considerably across listeners, a consequence of the listener-specific shape of the ears. 
It is not clear whether the differences in localization ability result from differences in 
the encoding of directional information provided by the HRTFs, i.e., an acoustic factor, or 
from differences in auditory processing of those cues (e.g., spectral-shape sensitivity), i.e., 
non-acoustic factors. We addressed this issue by analyzing the listener-specific localization 
ability in terms of localization performance. Directional responses to spatially distributed 
broadband stimuli from 18 listeners were used. A model of sagittal-plane localization was 
fit individually for each listener by considering the actual localization performance, the 
listener-specific HRTFs representing the acoustic factor, and an uncertainty parameter 
representing the non-acoustic factors. The model was configured to simulate the condition 
of complete calibration of the listener to the tested HRTFs. Listener-specifically calibrated 
model predictions yielded correlations of, on average, 0.93 with the actual localization 
performance. Then, the model parameters representing the acoustic and non-acoustic 
factors were systematically permuted across the listener group. While the permutation 
of HRTFs affected the localization performance, the permutation of listener-specific 
uncertainty had a substantially larger impact. Our findings suggest that across-listener 
variability in sagittal-plane localization ability is only marginally determined by the acoustic 
factor, i.e., the quality of directional cues found in typical human HRTFs. Rather, 
the non-acoustic factors, supposed to represent the listeners' efficiency in processing 
directional cues, appear to be important. 

Keywords: sound localization, localization model, sagittal plane, listener-specific factors, head-related transfer 
functions 



1. INTRODUCTION 

Human listeners use monaural spectral cues to localize sound 
sources in sagittal planes (e.g., Wightman and Kistler, 1997; van 
Wanrooij and van Opstal, 2005). This includes the ability to assign 
the vertical position of the source (e.g., Vliegen and van Opstal, 
2004) and to distinguish between front and back (e.g., Zhang and 
Hartmann, 2010). Spectral cues are caused by the acoustic filter- 
ing of the torso, head, and pinna, and can be described by means 
of head-related transfer functions (HRTFs; e.g., Moller et al., 
1995). The direction-dependent components of the HRTFs are 
described by directional transfer functions (DTFs, Middlebrooks, 
1999b). 

The ability to localize sound sources in sagittal planes, usually 
tested in psychoacoustic experiments as localization performance, 
varies largely across listeners (Middlebrooks, 1999a; Rakerd et al., 
1999; Zhang and Hartmann, 2010). A factor contributing to the 
variability across listeners might be the listeners' morphology. 
The ear shape varies across the human population (Algazi et al., 
2001) and these differences cause the DTF features to vary across 



individuals (Wightman and Kistler, 1997). One might expect that 
different DTF sets provide different amounts of cues available for 
the localization of a sound. When listening with DTFs of other 
listeners, the performance might be different, an effect we refer to 
in this study as the acoustic factor in sound localization. 

The strong effect of training on localization performance 
(Majdak et al, 2010, Figure 7) indicates that in addition to the 
acoustic factor, also other listener-specific factors are involved. 
For example, a link between the listener-specific sensitivity to 
the spectral envelope shape and the listener-specific localiza- 
tion performance has been recently shown (Andeol et al., 2013). 
However, other factors like the ability to perform the experimen- 
tal task, the attention paid to the relevant cues, or the accuracy 
in responding might contribute as well. In the present study, we 
consolidate all those factors to a single factor which we refer to as 
the non- acoustic factor. 

In this study, we are interested in the contribution of 
the acoustic and non-acoustic factors to sound localization 
performance. As for the acoustic factor, its effect on localization 
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performance has already been investigated in many studies (e.g., 
Wightman and Kistler, 1997; Middlebrooks, 1999a; Langendijk 
and Bronkhorst, 2002). However, most of those studies inves- 
tigated ad-hoc listening with modified DTFs without any re- 
calibration of the spectral-to-spatial mapping in the auditory 
system (Hofman et al, 1998). By testing the ad-hoc localization 
performance to modified DTFs, two factors were simultaneously 
varied: the directional cues in the incoming sound, and their 
mismatch to the familiarized (calibrated) mapping. The acous- 
tic factor of interest in our study, however, considers changes 
in the DTFs of the own ears, i.e., changes of DTFs without any 
mismatch between the incoming sound and the calibrated map- 
ping. A localization experiment testing such a condition would 
need to minimize the mismatch by achieving a re-calibration. 
Such a re-calibration is indeed achievable in an extensive training 
with modified DTFs, however, the experimental effort is rather 
demanding and requires weeks of exposure to the modified cues 
(Hofman and van Opstal, 1998; Majdak et al, 2013). Note that 
such a long-term re-calibration is usually attributed to percep- 
tual adaptation, in contrast to the short-term learning found to 
take place within hours (Zahorik et al., 2006; Parseihian and Katz, 
2012). 

Using a model of the localization process, the condition of 
a complete re-calibration can be more easily achieved. Thus, 
our study is based on predictions from a model of sagittal- 
plane sound localization (Baumgartner et al., 2013). This model 
assumes that listeners create an internal template set of their 
specific DTFs as a result of a learning process (Hofman et al., 
1998; van Wanrooij and van Opstal, 2005). The more simi- 
lar the representation of the incoming sound compared to a 
template, the larger the assumed probability of responding at 
the polar angle corresponding to that template (Langendijk and 
Bronkhorst, 2002). The model from Baumgartner et al. (2013) 
uses a method to compute localization performance based on 
probabilistic predictions and considers both acoustic factors in 
terms of the listener-specific DTFs and non-acoustic factors in 
terms of an uncertainty parameter U. In Baumgartner et al. 
(2013), the model has been validated under various conditions 
for broadband stationary sounds. In that model, the role of the 
acoustic factor can be investigated by simultaneously modifying 
DTFs of both the incoming sound and the template sets. This con- 
figuration allows to predict sound localization performance when 



listening with others' ears following a complete re-calibration to 
the tested DTFs. 

In the following, we briefly describe the model and revisit the 
listener-specific calibration of the model. Then, the effect of the 
uncertainty representing the non-acoustic factor, and the effect 
of the DTF set representing the acoustic factor, are investigated. 
Finally, the relative contributions of the two factors are compared. 

2. MATERIALS AND METHODS 
2.1. MODEL 

In this study, we used the model proposed by Baumgartner 
et al. (2013). The model relies on a comparison between an 
internal representation of the incoming sound and an internal 
template set (Zakarauskas and Cynader, 1993; Hofman and van 
Opstal, 1998; Langendijk and Bronkhorst, 2002; Baumgartner 
et al., 2013). The internal template set is assumed to be created by 
means of learning the correspondence between the spectral fea- 
tures and the direction of an acoustic event based on feedback 
from other modalities (Hofman et al, 1998; van Wanrooij and 
van Opstal, 2005). The model is implemented in the Auditory 
Modeling Toolbox as baumgartner2 013 (Sondergaard and 
Majdak, 2013). 

Figure 1 shows the basic structure of the model from 
Baumgartner et al. (2013). Each block represents a processing 
stage of the auditory system in a functional way. The target sound 
is processed in order to obtain an internal (neural) representa- 
tion. This target representation is compared to an equivalently 
processed internal template set consisting of the DTF represen- 
tations for the given sagittal plane. This comparison process is the 
basis of a spectral-to-spatial mapping, which yields the prediction 
probability for responding at a given polar angle. 

In general, in this study, we used the model configured as 
suggested in Baumgartner et al. (2013). In the following, we sum- 
marize the model stages and their configuration, focusing on the 
acoustic and non-acoustic factors in the localization process. 

2. 1. 1. Peripheral processing 

In the model, the same peripheral processing is considered for 
the incoming sound and the template. The peripheral process- 
ing stage aims at modeling the effect of human physiology while 
focusing on directional cues. The effect of the torso, head and 
pinna are considered by filtering the incoming sound by a DTF. 
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FIGURE 1 | Structure of the sound localization model from 

Baumgartner et al. (2013). The incoming target sound is peripherally 
processed and the result is compared to an internal template set. 
The comparison result is mapped yielding the probability for 



responding at a given polar angle. The blue arrows indicate the free 
parameters of the corresponding sections. In the model, the DTF 
set and the uncertainty represent the acoustic and non-acoustic 
factors, respectively. 
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The effect of the cochlear filtering was considered as linear 
Gammatone filter bank (Patterson et al., 1988). The filter bank 
produces a signal for each frequency band. 28 frequency bands 
were considered in the model, determined by the lowest frequency 
of 0.7 kHz, the highest frequency of 18 kHz, and the frequency 
spacing of the bands corresponding to one equivalent rectangular 
bandwidth (Glasberg and Moore, 1990). In the model, the out- 
put of each frequency band is half-wave rectified and low-pass 
filtered (2nd-order Butterworth filter, cut-off frequency of 1 kHz) 
in order to simulate the effect of the inner hair cells (Dau et al., 
1996). The filtered outputs are then temporally averaged in terms 
of root-mean-square (RMS) amplitude, resulting in the internal 
representation of the sound. 

2. 1.2. Comparison stage 

In the comparison stage, the internal representation of the incom- 
ing sound is compared with the internal template set. Each 
template is selected by a polar angle denoted as template angle. 
A distance metric is calculated as a function of the template angle 
and is interpreted as a descriptor contributing to the prediction 
of the listener's response. 

In the model, the distance metric is represented by the stan- 
dard deviation (SD) of the inter-spectral differences between the 
internal representation of the incoming sound and a template 
calculated across frequency bands. The SD of inter-spectral dif- 
ferences is robust against changes in overall level and has been 
shown to be superior to other metrics like the inter-spectral 
cross-correlation coefficient (Langendijk and Bronkhorst, 2002). 

2.1.3. Spatial mapping 

In the model, a probabilistic approach is used for the mapping 
of the distance metric to the predicted response probability. For a 
particular target angle, response angle, and ear, the distance met- 
ric is mapped by a Gaussian function to a similarity index (SI), 
interpreted as a measure reflecting the response probability for a 
response angle. 

The mapping function actually reflects the non-acoustic fac- 
tor of the localization process. In the model, the width of the 
Gaussian function was considered as a property of an individ- 
ual listener. Baumgartner et al. (2013) assumed that a listener 
being more precise in the response to the same sound would 
need a more narrow mapping than a less precise listener. Thus, 
the width of the mapping function was interpreted as a listener- 
specific uncertainty, U. In the model, it accounted for listener- 
specific localization performance and was a free parameter in 
the calibration process. In Langendijk and Bronkhorst (2002), 
the uncertainty parameter has actually also been used (their S), 
however, it was considered to be constant for all listeners, thus 
representing a rather general property of the auditory system. 
The impact of the uncertainty U, representing the non-acoustic 
factor responsible for the listener variability on the predicted 
localization performance is described in the following sections. 

In the model, the contribution of the two ears was consid- 
ered by applying a binaural weighting function (Morimoto, 2001; 
Macpherson and Sabin, 2007), which reduces the contribution 
of the contralateral ear with increasing lateral angle of the tar- 
get sound. The binaural weighting function is applied to each 



monaural SI, and the sum of the weighted monaural Sis yields 
the binaural SI. 

In the model, for a given target angle, the binaural Sis are cal- 
culated as a function of the response angle, i.e., for all templates. 
The SI as a function of response angle is scaled to a sum of one in 
order to be interpreted as a probability mass vector (PMV), i.e., 
a discrete version of a probability density function. Such a PMV 
describes the listener's response probability as a function of the 
response angle for a given incoming sound. 

2.2. EXPERIMENTAL CONDITIONS FOR CALIBRATION 

In Baumgartner et al. (2013), the model was calibrated to the 
actual performance of a pool of listeners for the so-called base- 
line condition, for which actual data (DTFs and localization 
responses) were collected in two studies, namely in Goupell 
et al. (2010) and Majdak et al. (2013). In both studies, local- 
ization responses were collected using virtual stimuli presented 
via headphones. While localization performance seems to be 
better when using free-field stimuli presented via loudspeakers 
(Middlebrooks, 1999b), we used virtual stimuli in order to better 
control for cues like head movements, loudspeaker equalization, 
or room reflections. In this section, we summarize the methods 
used to obtain the baseline conditions in those two studies. 

2.2.1. Subjects 

In total, 18 listeners were considered for the calibration. Eight 
listeners were from Goupell et al. (2010) and 13 listeners were 
from Majdak et al. (2013), i.e., three listeners participated in 
both studies. None of them had indications of hearing disorders. 
All of them had thresholds of 20-dB hearing level or lower at 
frequencies from 0.125 to 12.5 kHz. 

2.2.2. HRTFs and DTFs 

In both Goupell et al. (2010) and Majdak et al. (2013), HRTFs 
were measured individually for each listener. The DTFs were then 
calculated from the HRTFs. Both HRTFs and DTFs are part of the 
ARI HRTF database (Majdak et al, 2010). 

Twenty-two loudspeakers (custom-made boxes with VIFA 10 
BGS as drivers) were mounted on a vertical circular arc at fixed 
elevations from —30° to 80°, with a 10° spacing between 70° 
and 80° and 5° spacing elsewhere. The listener was seated in 
the center point of the circular arc on a computer-controlled 
rotating chair. The distance between the center point and each 
speaker was 1.2 m. Microphones (Sennheiser KE-4-211-2) were 
inserted into the listener's ear canals and their output signals were 
directly recorded via amplifiers (FP-MP1, RDL) by the digital 
audio interface. 

A 1729-ms exponential frequency sweep from 0.05 to 20 kHz 
was used to measure each HRTF. To speed up the measurement, 
for each azimuth, the multiple exponential sweep method was 
used (Majdak et al., 2007). At an elevation of 0°, the HRTFs were 
measured with a horizontal spacing of 2.5° within the range of 
±45° and 5° otherwise. With this rule, the measurement posi- 
tions for other elevations were distributed with a constant spatial 
angle, i.e., the horizontal angular spacing increased with the ele- 
vation. In total, HRTFs for 1550 positions within the full 360° 
horizontal span were measured for each listener. The measure- 
ment procedure lasted for approximately 20min. The acoustic 
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influence of the equipment was removed by equalizing the HRTFs 
with the transfer functions of the equipment. The equipment 
transfer functions were derived from reference measurements 
in which the microphones were placed at the center point of 
the circular arc and the measurements were performed for all 
loudspeakers. 

The DTFs (Middlebrooks, 1999b) were calculated. The mag- 
nitude of the common transfer function (CTF) was calculated by 
averaging the log-amplitude spectra of all HRTFs for each indi- 
vidual listener and ear. The phase spectrum of the CTF was set to 
the minimum phase corresponding to the amplitude spectrum. 
The DTFs were the result of filtering HRTFs with the inverse 
complex CTF. Finally, the impulse responses of all DTFs were win- 
dowed with an asymmetric Tukey window (fade in of 0.5 ms and 
fade out of 1 ms) to a 5.33-ms duration. 

2.2.3. Stimulus 

In Majdak et al. (2013), the experiments were performed for 
targets in the lateral range of ±60°. In Goupell et al. (2010), 
the experiments were performed for targets in the lateral range 
of ±10°. The direction of a target is described by the polar angle 
ranging from —30° (front, below eye-level) to 210° (rear, below 
eye-level). 

The audio stimuli were Gaussian white noise bursts with a 
duration of 500 ms, which were filtered with the listener-specific 
DTFs corresponding to the tested condition. The level of the stim- 
uli was 50 dB above the individually measured absolute detection 
threshold for that stimulus, estimated in a manual up-down pro- 
cedure for a frontal eye-leveled position. In the experiments, the 
stimulus level was randomly roved for each trial within the range 
of ±5 dB in order to reduce the possibility of using overall level 
cues for localization. 

2.2.4. Apparatus 

In both studies, Goupell et al. (2010) and Majdak et al. (2013), 
the virtual acoustic stimuli were presented via headphones (HD 
580, Sennheiser) in a semi-anechoic room. Stimuli were generated 
using a computer and output via a digital audio interface (ADI-8, 
RME) with a 48-kHz sampling rate. A virtual visual environment 
was presented via a head-mounted display (3-Scope, Trivisio). It 
provided two screens with a field of view of 32° x 24° (horizontal 
x vertical dimension). The virtual visual environment was pre- 
sented binocularly with the same picture for both eyes. A tracking 
sensor (Flock of Birds, Ascension), mounted on the top of the lis- 
tener's head, captured the position and orientation of the head 
in real time. A second tracking sensor was mounted on a manual 
pointer. The tracking data were used for the 3-D graphic render- 
ing and response acquisition. More details about the apparatus 
are provided in Majdak et al. (2010). 

2.2.5. Procedure 

For the calibration, the data were collected in two studies using 
the same procedure. In Goupell et al. (2010), the data were the 
last 300 trials collected within the acoustic training, see their Sec. 
II. D. In Majdak et al. (2013), the data were the 300 trials col- 
lected within the acoustic test performed at the beginning of the 
pre-training experiments, see their Sec. II. D. In the following, we 
summarize the procedure used in the two studies. 



In both studies, the listeners were immersed in a spherical vir- 
tual visual environment (for more details see Majdak et al, 2010). 
They were standing on a platform and held a pointer in their 
right hand. The projection of the pointer direction on the sphere's 
surface, calculated based on the position and orientation of the 
tracker sensors, was visualized and recorded as the perceived tar- 
get position. The pointer was visualized whenever it was in the 
listeners' field of view. 

Prior to the acoustic tests, listeners participated in a visual 
training procedure with the goal to train them to point accu- 
rately to the target. The visual training was a simplified game in 
the first-person perspective in which listeners had to find a visual 
target, point at it, and click a button within a limited time period. 
This training was continued until 95% of the targets were found 
with an RMS angular error smaller than 2° . This performance was 
reached within a few hundred trials. 

In the acoustic experiments, at the beginning of each trial, the 
listeners were asked to align themselves with the reference posi- 
tion, keep the head direction constant, and click a button. Then, 
the stimulus was presented. The listeners were asked to point to 
the perceived stimulus location and click the button again. Then, 
a visual target in the form of a red rotating cube was shown at 
the position of the acoustic target. In cases where the target was 
outside of the field of view, arrows pointed towards its position. 
The listeners were asked to find the target, point at it, and click 
the button. At this point in the procedure, the listeners had both 
heard the acoustic target and seen the visualization of its position. 
To stress the link between visual and acoustic location, the listen- 
ers were asked to return to the reference position and listen to the 
same acoustic target once more. The visual feedback was intended 
to trigger a procedural training in order to improve the localiza- 
tion performance within the first few hundred of trials (Majdak 
et al, 2010). During this second acoustic presentation, the visual 
target remained visualized in the visual environment. Then, while 
the target was still visualized, the listeners had to point at the tar- 
get and click the button again. An experimental block consisted 
of 50 targets and lasted for approximately 15 min. 

2.3. DATA ANALYSIS 

In the psychoacoustic experiments, the errors were calculated 
by subtracting the target angles from the response angles. We 
separated our data analysis into confusions between the hemi- 
fields and the local performance within the correct hemifield. The 
rate of confusions was represented by the quadrant error (QE), 
which is the percentage of responses where the absolute polar 
error exceeded 90° (Middlebrooks, 1999b). In order to quantify 
the local performance in the polar dimension, the local polar 
RMS error (PE) was calculated, i.e., the RMS of the polar errors 
calculated for the data without QEs. 

The listener-specific results from both Goupell et al. (2010) 
and Majdak et al. (2013) were pooled. Only responses within the 
lateral range of ±30° were considered because (1) most of the 
localization responses were given in that range, (2) Baumgartner 
et al. (2013) evaluated the model using only that range, and (3) 
recent evaluations indicate that predictions for that range seem 
to be slightly more accurate than those for more lateral ranges 
(Baumgartner et al., 2014). For the considered data, the average 
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QE was 9.3% ± 6.0% and the average PE was 34° ± 5°. This is 
similar to the results from Middlebrooks (1999b) who tested 14 
listeners in virtual condition using DTFs. His average QE was 
7.7% ± 8.0% and the average PE was 29° ± 5°. 

In the model, targets in the lateral range of ±30° were consid- 
ered in order to match the lateral range of the actual targets from 
the localization experiments. For each listener, PMVs were calcu- 
lated for three lateral segments with a lateral width of 20° each, 
and these PMVs were evaluated corresponding to the actual lat- 
eral target angles. The QE was the sum of the corresponding PMV 
entries outside the local polar range for which the response-target 
distance exceeded 90°. The PE was the discrete expectancy value 
within the local polar range. Both errors were calculated as the 
arithmetic averages across all polar target angles considered. 

3. RESULTS AND DISCUSSION 
3.1. MODEL CALIBRATION 

In Baumgartner et al. (2013), the model was calibrated individ- 
ually for each listener by finding the uncertainty U providing 
the smallest residual in the predictions as compared to the actual 
performance obtained in the localization experiments. 

In our study, this calibration process was revisited. For each 
listener and all target directions, PMVs were calculated for vary- 
ing uncertainty U ranging from 0.1 to 4.0 in steps of 0.1. 
Listener-specific DTFs were used for both the template set and 
incoming sound. Figure 2 shows PMVs and the actual local- 
ization responses for four exemplary listeners and exemplary 
uncertainties. 

For each listener, the predicted PEs and QEs were calculated 
from the PMVs, and the actual PEs and QEs were calculated 



from the experimental results. Figure 3 shows the predicted QEs 
and PEs as a function of the uncertainty for the four exemplary 
listeners. The symbols show the actual QEs and PEs. 

In Baumgartner et al. (2013), the uncertainty yielding the 
smallest squared sum of residues between the actual and pre- 
dicted performances (PE and QE) was considered as optimal. 
Using the same procedure, the optimal uncertainties Uk were 
calculated for each listener k and are shown in Table 1 . For the 
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FIGURE 3 | Predicted localization performance depends on the 
uncertainty. PEs and QEs are shown as functions of U for four exemplary 
listeners {k = 3: blue squares, k = 9: red triangles, k = 12: green diamonds, 
k = 1 5: black circles). Lines show the model predictions. Symbols show the 
actual performance obtained in the localization experiment (placement on 
the abscissa corresponds to the optimal listener-specific uncertainty iy. 
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FIGURE 2 | Actual and modeled localization. Actual localization responses (circles) and modeled response probabilities (PMVs, brightness encoded) 
calculated for three uncertainties U and four exemplary listeners indexed by k. 
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Table 1 | Uncertainty (7/< of individual listener with index k. 
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FIGURE 4 | Predicted versus actual localization performance. 

Predicted PEs and QEs are shown as functions of the actual PEs 
and QEs, respectively, for each listener. (A) Optimal listener-specific 
uncertainties U k . (B) Listener-constant uncertainty yielding best 
correlation for PE, U = 2.89. (C) Listener-constant uncertainty yielding 



best correlation for QE, U= 1.87. (D) Listener-constant uncertainty 
from (Langendijk and Bronkhorst, 2002), (7 = 2.0. (E) Listener-specific 
uncertainties U k and the same DTF set «r=14) for all listeners 
(see Section 3.3 for more details). The correlation coefficient is 
denoted by r. 



listener group, the average listener-specific uncertainty amounted 
to 2.05 (SD = 0.37). 

With the optimal listener-specific uncertainties from Table 1, 
predictions were compared to the actual localization perfor- 
mances. Figure 4A shows the correspondence between the actual 
and predicted QEs and PEs of all listeners when using those 
listener-specific uncertainties. For the listener group, the corre- 
lation coefficient between actual and predicted localization errors 
was 0.88 for PE and 0.97 for QE. In Baumgartner et al. (2013), 
the model calibrated with those optimal uncertainties was evalu- 
ated in further conditions involving DTF modifications yielding 
correlation coefficients in the range of 0.75. 

3.2. NON-ACOUSTIC FACTOR: LISTENER-SPECIFIC UNCERTAINTY 

In Baumgartner et al. (2013), the optimal listener-specific uncer- 
tainties were assumed to yield most accurate performance predic- 
tions. In Langendijk and Bronkhorst (2002), the effect of spectral 
cues was modeled by using a parameter corresponding to our 
uncertainty. Interestingly, that parameter was constant for all lis- 
teners and the impact of this listener-specific uncertainty is not 



clarified yet. Thus, in this section, we investigate the effect of 
uncertainty being listener-specific as compared to uncertainty 
being constant for all listeners, when using the model from 
Baumgartner et al. (2013). 

Predictions were calculated with a model calibrated to uncer- 
tainty being constant for all listeners. Three uncertainties were 
used: (1) U = 2.89, which yielded largest correlation with the 
actual PEs of the listeners, (2) 17 = 1.87, which yielded largest 
correlation with the actual QEs, and (3) U = 2.0, which corre- 
sponds to that used in Langendijk and Bronkhorst (2002). The 
DTFs used for the incoming sound and the template set were 
still listener-specific, representing the condition of listening with 
own ears. The predictions are shown in Figures 4B-D. The cor- 
responding correlation coefficients are shown as insets in the 
corresponding panels. From this comparison and the compar- 
ison to that for listener-specific uncertainties (Figure 4A), it is 
evident that listener-specific calibration is required to account for 
the listener-specific actual performance. 

Our findings are consistent with the results from Langendijk 
and Bronkhorst (2002) who used a constant calibration for all 
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listeners. The focus of that study was to investigate the change 
in predictions caused by the variation of spectral cues. Thus, 
prediction changes for different conditions within an individ- 
ual listener were important, which, in the light of the model 
from Baumgartner et al. (2013), correspond to the variation of 
the DTFs used for the incoming sound and not to the varia- 
tion of the uncertainty. U = 2.0 seems to be indeed an adequate 
choice for predictions for an "average listener". This is sup- 
ported by the similar average uncertainty of our listener group 
( U = 2.05). It is further supported by the performance predicted 
with U = 2.0, which was similar to the actual group performance. 
For acurate listener-specific predictions, however, listener-specific 
uncertainty is required. 

The listener-constant uncertainty seems to have largely 
reduced the predicted performance variability in the listener 
group. In order to quantify this observation, the group SDs were 
calculated for predictions with listener-constant U from 1.1 to 2.9 
in steps of 0.1 for each listener. For PE, the group SD was 0.96° ± 
0.32°. For QE, the group SD was 1.34% ± 0.87%. For compari- 
son, the group SD for predictions with listener-spe«y?c uncertain- 
ties was 4.58° and 5.07% for PE and QE, respectively, i.e., three 
times larger than those for predictions with the listener-constant 
uncertainties. 

In summary, the listener-specific uncertainty seems to be 
vital to obtain accurate predictions of the listeners' actual per- 
formance. The listener-constant uncertainty drastically reduced 
the correlation between the predicted and actual performance. 
Further, the listener-constant uncertainty reduced the group vari- 
ability in the predictions. Thus, as the only parameter varied in 
the model, the uncertainty seems to determine to a large degree 
the baseline performance predicted by the model. It can be inter- 
preted as a parameter calibrating the model in order to represent 
a good or bad localizer; the smaller the uncertainty, the better the 
listeners' performance in a localization task. Notably, uncertainty 
is not associated with any acoustic information considered in the 
model, and thus, it represents the non-acoustic factor in modeling 
sound localization. 

3.3. ACOUSTIC FACTOR: LISTENER-SPECIFIC DIRECTIONAL CUES 

In the previous section, the model predictions were calculated 
for listeners' own DTFs in both the template set and the incom- 
ing sound; a condition corresponding to listening with own ears. 
With the DTFs of other listeners but own uncertainty, their 
performance might have been different. 

For the investigation of that effect, one possibility would be 
to vary the quality of the DTF sets along a continuum simulta- 
neously in both the incoming sound and the template set, and 
analyze the corresponding changes in the predictions. Such an 
investigation would be, in principle, similar to that from the 
previous section where the uncertainty was varied and the pre- 
dicted performance was analyzed. While U represents a measure 
of the uncertainty, a similar metric would be required in order to 
quantify the quality differences between two DTF sets. Finding 
an appropriate metric is challenging. A potentially useful met- 
ric is the spectral SD of inter-spectral differences (Middlebrooks, 
1999b; Langendijk and Bronkhorst, 2002) as used in the model 
from (Baumgartner et al., 2013) as the distance metric and thus 



as basis for the predictions. Being a part of the model, however, 
this metric is barred from being an independent factor in our 
investigation. 

In order to analyze the DTF set variation as a parameter with- 
out any need for quantification of the variation, we systematically 
replaced the listeners' own DTFs by DTFs from other listeners 
from this study. The permutation of the DTF sets and uncer- 
tainties within the same listener group allowed us to estimate the 
effect of directional cues relative to the effect of uncertainty on 
the localization performance of our group. 

For each listener, the model predictions were calculated using 
a combination of DTF sets and uncertainties of all listeners from 
the group. Indexing each listener by k, predicted PEs and QEs 
as functions of Uk and Dk were obtained, with Uk and Dk being 
the uncertainty and the DTF set, respectively, of the fc-th listener. 
Figure 5 shows the predicted PEs and QEs for all combinations 
of Uk and Dk. The listener group was sorted such that the uncer- 
tainty increases with increasing k and the same sorting order was 
used for Dk- This sorting order corresponds to that from Table 1. 

The results reflect some of the effects described in the previous 
sections. The main diagonal represents the special case of identical 
k for Dk and Uk, corresponding to listener-specific performance, 
i.e., predictions for each listener's actual DTFs and optimal 
listener-specific uncertainty from the calibrated model described 
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Uncertainty of /c-th listener, U 



FIGURE 5 | Localization performance depends on the uncertainty and 
DTF set. Predicted PEs and QEs as functions of the uncertainty of <r-th 
listener (L4) and DTF set of <r-th listener (Die). 
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in Section 3.1. Each row, i.e., constant Dk but varying Uk, 
represents the listener-specific effect of the uncertainty described 
in Section 3.2, i.e., listening with own ears but having different 
uncertainties. 

In this section, we focus on the results in the columns. Each 
column describes results for a constant Uk but varying Dk, rep- 
resenting the listener-specific effect of the DTF set. While the 
predictions show a variation across both columns and rows, 
i.e., substantial effects of both uncertainty and DTF set, some 
DTF sets show clear differences to others across all uncertainties. 
This analysis is, however, confounded by the different baseline 
performance of each listener and can be improved by consider- 
ing the performance relative to the listener-specific performance. 
Figure 6 shows APEs and AQEs, i.e., PEs and QEs relative to 
the listener-specific PEs and QEs, respectively, averaged over all 
uncertainties for each DTF set D^. Positive values represent the 
performance amount by which our listener group would dete- 
riorate when listening with the DTF set of fc-th listener (and 
being fully re-calibrated). For example, the DTF sets of listen- 
ers k = 9 and k = 15 show such deteriorations. Those DTF sets 
seem to have provided less accessible directional cues. Further, 
DTF sets improving the performance for the listeners can be iden- 
tified, see for example, the DTF sets of listeners k = 3 and k = 12. 
These DTF sets seem to have provided more accessible directional 
cues. The effect of those four DTF sets can be also examined in 
Figure 2 by comparing the predictions for constant uncertainties, 
i.e., across rows. 

Thus, variation of the DTF sets had an effect on the pre- 
dictions suggesting that it also affects the comparison of the 
predictions with the actual performance. This leads to the ques- 
tion to what extend a constant DTF set across all listeners can 
explain the actual performances? It might even be the case that 
listener-specific DTFs are not required for accurate predictions. 



Thus, similarly to the analysis from Section 3.2 where the impact 
of listener-specific uncertainty was related to that of a listener- 
constant uncertainty, here, we compare the impact of listener- 
specific DTF sets relative to that of a listener-constant DTF set. 
To this end, predictions were calculated with a model calibrated 
to the same DTF set for all listeners but with a listener-specific 
uncertainty. All DTF sets from the pool of available listeners 
were tested. For each of the DTF sets, correlation coefficients 
between the actual and predicted performances were calculated. 
The correlation coefficients averaged over all DTF sets were 0.86 
(SD = 0.007) for PE and 0.89 (SD = 0.006) for QE. Note the 
extremely small variability across the different DTF sets, indi- 
cating only little impact of the DTF set on the predictions. 
The DTF set from listener k = 14 yielded the largest correlation 
coefficients, which were 0.87 for PE and 0.89 for QE. The cor- 
responding predictions as functions of the actual performance 
are shown in Figure 4E. Note the similarity to the predictions 
for the listener-specific DTF sets (Figure 4A). These findings 
have a practical implication when modeling the baseline perfor- 
mance of sound localization: for an arbitrary listener, the DTFs of 
another arbitrary listener, e.g., NH68 (k = 14), might still yield 
listener- specific predictions. 

Recall that in our investigation, both the incoming sound and 
the template set were filtered by the same DTF set, correspond- 
ing to a condition where the listener is completely re-calibrated 
to those DTFs. The highest correlation found for NH68's DTF 
set does not imply that this DTF set is optimal for ad-hoc 
listening. 

In summary, the predicted localization performance varied 
by a small amount depending on the directional cues provided 
by the different DTF sets, even when the listener-specific uncer- 
tainty was considered. Note that full re-calibration was simulated. 
This finding indicates that some of the DTF sets provide better 
access to directional cues than others. Even though the acoustic 
factor might contribute to the variability in localization perfor- 
mance across listeners, the same DTF set of a single listener (here, 
NH68) for modeling performance of all listeners yielded still a 
good prediction accuracy. 

3.4. RELATIVE CONTRIBUTIONS OF ACOUSTIC AND NON-ACOUSTIC 
FACTORS 

Both the DTF set and the uncertainty had an effect on the 
predicted localization performance. However, a listener-constant 
DTF set provided still acceptable predictions, while a listener- 
constant uncertainty did not. In this section, we aim at directly 
comparing the relative contributions of the two factors to local- 
ization performance. To this end, we compare the SDs in the 
predictions as a function of each of the factors. The factor causing 
more variation in the predictions is assumed to have more impact 
on sound localization. 

We used PEs and QEs predicted for all combinations of uncer- 
tainties and DTF sets, as shown in Figure 5. For each listener 
and each performance metric, two SDs were calculated: (1) as a 
function of the listener-specific DTF set Dk for all available uncer- 
tainties, i.e., calculating the SDs across a column separately for 
each row; and (2) as a function of the listener-specific uncer- 
tainty Uk for all available DTF sets, i.e. calculating the SD across 
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FIGURE 6 | Listener-specific performance depends on the DTF set used 
in the model. APEs and AQEs averaged over all Uts as a function of D^. 
APEs and AQEs are the PEs and QEs relative to the listener-specific PEs 
and QEs, respectively. The whiskers show ±1 SD. 



Frontiers in Psychology | Auditory Cognitive Neuroscience 



April 2014 | Volume 5 | Article 319 | 8 



Majdak et al. 



Listener-specific factors in sound localization 



<D 4 



2 

S 1 



LU ** 

a 

a 2 
w 



° □ a □ 



OOOOOOOOOOOOOOOOo 

o 

— I 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1_ 



□ □ ° 



□ „ □ □ 



OOOOOOOOOOOOOOOq 



1 3 5 7 9 11 13 15 17 

Listener k 



FIGURE 7 | DTF set contributes less than uncertainty to the 
performance variability of the group. PE SDs and QE SDs as functions of 
either listener-constant DTF set calculated for listener-specific uncertainties 
(Uk varied, blue squares) or the listener-constant uncertainty calculated for 
listener-specific DTF sets (DTF varied, red circles). The abscissa is sorted by 
the ascending listener-specific uncertainty U^. 



a row separately for each column. Figure 7 shows these SDs as 
functions of the k-th listener, sorted by ascending listener-specific 
uncertainty. When Uk was varied, the average SD across listeners 
was 4.4° ± 0.3° and 5.1% ± 0.4% for PE and QE, respectively. 
When the DTF set was varied, the average SD was 1.2° ± 0.1° and 
1.9% ± 0.3% for PE and QE, respectively. On average, the fac- 
tor uncertainty caused more than twice as much variability as the 
factor DTF set. 

This analysis shows that while both listener-specific uncer- 
tainty and listener-specific DTF set were important for the 
accuracy in predicted localization performance, the uncertainty 
affected the performance much more than the DTF set. This 
indicates that the non-acoustic factor, uncertainty, contributes 
more than the acoustic factor, DTF set, to the localization per- 
formance. This is consistent with the observations of Andeol et al. 
(2013), where localization performance correlated with the detec- 
tion thresholds for spectral modulation, but did not correlate 
with the prominence of the HRTF's spectral shape. The direc- 
tional information captured by the spectral shape prominence 
corresponds to the acoustic factor in our study. The sensitivity 
to the spectral modulations represents the non-acoustic factor in 
our study. Even though the acoustic factor (DTF set) contributed 
to the localization performance of an individual listener, the dif- 
ferences between the listeners seem to be more determined by a 
non-acoustic factor (uncertainty). 

Note that the separation of the sound localization process into 
acoustic and non-acoustic factors in our model assumes a per- 
fect calibration of a listener to a DTF set. It should be considered, 
though, that listeners might actually be calibrated at different 
levels to their own DTFs. In such a case, the potentially differ- 
ent levels of calibration would be implicitly considered in the 
model by different uncertainties, confounding the interpretation 
of the relative contribution of the acoustic and non-acoustic fac- 
tors. While the general capability to re-calibrate to a new DTF set 
has been investigated quite well (Hofman and van Opstal, 1998; 



Majdak et al., 2013), the level of calibration to the own DTF set 
has not been clarified yet. 

4. CONCLUSIONS 

In this study, a sound localization model predicting the localiza- 
tion performance in sagittal planes (Baumgartner et al., 2013) 
was applied to investigate the relative contributions of acoustic 
and non-acoustic factors to localization performance in the lateral 
range of ±30°. The acoustic factor was represented by the direc- 
tional cues provided by the DTF sets of individual listeners. The 
non-acoustic factor was represented by the listener-specific uncer- 
tainty considered to describe processes related to the efficiency of 
processing the spectral cues. Listener-specific uncertainties were 
estimated in order to calibrate the model to the actual perfor- 
mance when localizing broadband noises with own ears. Then, 
predictions were calculated for the permutation of DTF sets and 
uncertainties across the listener group. Identical DTF sets were 
used for the incoming sound and the template set, which allowed 
to simulate the listeners being completely re-calibrated to the 
tested DTF sets, a condition nearly unachievable in psychoacous- 
tic localization experiments. 

Our results show that both the acoustic and non-acoustic 
factors affected the modeled localization performance. The non- 
acoustic factor had a strong effect on the predictions, and 
accounted very well for the differences between the individual lis- 
teners. In comparison, the acoustic factor had much less effect on 
the predictions. In an extreme case of using the same DTF set for 
modeling performance for all listeners, an acceptable prediction 
accuracy was still obtained. 

Note that our investigation considered only targets positioned 
in sagittal planes of ±30° around the median plane. Even though 
we do not have evidence for contradicting conclusions for more 
lateral sagittal planes, one should be careful when applying our 
conclusions to more lateral targets. Further, the model assumes 
direction-static and stationary stimuli presented in the free field. 
In realistic listening situations, listeners can move their head, 
the acoustic signals are temporally fluctuating, and reverberation 
interacts with the direct sound. 

An unexpected conclusion from our study is that, globally, 
i.e., on average across all considered directions, all the tested 
DTF sets encoded the directional information similarly well. It 
seems like listener-specific DTFs are not necessarily required 
for predicting the global listener-specific localization ability in 
terms of distinguishing between bad and good localizers. What 
seems to be required, however, is an accurate estimate of the 
listener-specific uncertainty. One could speculate that, given a 
potential relation between the uncertainty and a measure of 
spectral-shape sensitivity, in the future, the global listener-specific 
localization ability might be predictable by obtaining a measure 
of the listener-specific uncertainty in a non-spatial experimen- 
tal task without any requirement of listener-specific localization 
responses. 
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